Neural network-based method for diagnosis and severity assessment of Graves’ orbitopathy using orbital computed tomography

Computed tomography (CT) has been widely used to diagnose Graves’ orbitopathy, and the utility is gradually increasing. To develop a neural network (NN)-based method for diagnosis and severity assessment of Graves’ orbitopathy (GO) using orbital CT, a specific type of NN optimized for diagnosing GO was developed and trained using 288 orbital CT scans obtained from patients with mild and moderate-to-severe GO and normal controls. The developed NN was compared with three conventional NNs [GoogleNet Inception v1 (GoogLeNet), 50-layer Deep Residual Learning (ResNet-50), and 16-layer Very Deep Convolutional Network from Visual Geometry group (VGG-16)]. The diagnostic performance was also compared with that of three oculoplastic specialists. The developed NN had an area under receiver operating curve (AUC) of 0.979 for diagnosing patients with moderate-to-severe GO. Receiver operating curve (ROC) analysis yielded AUCs of 0.827 for GoogLeNet, 0.611 for ResNet-50, 0.540 for VGG-16, and 0.975 for the oculoplastic specialists for diagnosing moderate-to-severe GO. For the diagnosis of mild GO, the developed NN yielded an AUC of 0.895, which is better than the performances of the other NNs and oculoplastic specialists. This study may contribute to NN-based interpretation of orbital CTs for diagnosing various orbital diseases


Scientific Reports
| (2022) 12:12071 | https://doi.org/10.1038/s41598-022-16217-z www.nature.com/scientificreports/ analysis, they may be insufficient for diseases that require multiple specialized images to detect structural changes, such as orbital disorders. Diagnosis of diseases by applying NN-based methods to CT images is already being attempted for a few organs such as the lungs and brain 14,15 . However, NN-based techniques using orbital CT have rarely been investigated in diagnosing GO, and the usefulness remains to be unveiled. Therefore, in this study, we developed and evaluated a new NN that can be applied to GO diagnosis and severity assessment by analyzing orbital CT images.

Results
The average age of the patients with GO was higher than that of the normal controls, and the average age of patients with moderate-to-severe GO was higher than that of the mild GO patients (p < 0.001). The gender ratios also differed among the groups, with a higher proportion of women in the control group. The margin reflex distance (MRD) 1 and exophthalmos were significantly different between the groups (p < 0.001). The MRD1 was significantly higher in moderate-to-severe GO patients than other groups, and significantly higher in mild GO patients than controls. The exophthalmos was significantly greater in moderate-to-severe GO patients than mild GO patients and controls. There were no significant differences in exophthalmos between mild GO patients and controls. Table 1 presents the demographic characteristics of the groups. Table 2 presents the experimental results of the proposed and conventional NNs, reporting the values averaged over 30 repetitive experiments. The experimental results indicate that the proposed method achieved an AUC of 0.905 for moderate-to-severe GO patients vs. mild GO patients vs. normal controls. For the moderate-tosevere GO patients vs. normal controls, the proposed NN achieved an AUC of 0.979, while yielded a relatively low AUC of 0.895 for the mild GO patients vs. normal controls. It is interesting to note that the combination of all axial, coronal, and sagittal plane images produced higher diagnostic performance in terms of AUC than the use of CT images of only one or two planes. In addition, when using CT images of only one or two planes, the axial plane images tend to produce higher AUC compared with other planes for the mild GO patients vs. normal controls. On the other hand, using the sagittal plane images tends to produce higher AUC compared with other planes for the moderate-to-severe GO patients vs. mild GO patients and moderate-to-severe GO patients vs. The performance of GoogLeNet, which has the best diagnostic performance among the three conventional NNs, was much lower than that of the proposed model (p < 0.001). The AUC for discriminating between moderate-to-severe GO patients vs. mild GO patients vs. normal controls was 0.673. In the diagnosis of moderate-tosevere GO patients vs. normal controls, ROC curve analysis revealed 0.827 for GoogLeNet, 0.611 for ResNet-50, and 0.540 for VGG-16, respectively. Since each of the conventional NNs has only one input node, only singleplane CT images were selected for each analysis.
Furthermore, we specified an ablation study of NN structure by changing the primary components from a baseline NN structure until the proposed structure is achieved. Table 3 presents the results of the ablation study, reporting the AUC values averaged over ten repetitive experiments for the moderate-to-severe GO patients vs. mild GO patients vs. normal controls. Specifically, Stage1 NN is the baseline structure and consists of three standard convolutional layers and one fully connected layer, reporting baseline performance. Stage2 NN, including depthwise convolutions, exhibited considerable AUC improvement compared to Stage1 NN. Compared with the Stage2 NN, Stage3 NN improved AUC by including half depthwise convolution for orbit comparison. Finally, the proposed NN further improved AUC by separating the parameters of the half depthwise convolution into parts for the left and right orbits. Figure 1 shows ROC curves representing the diagnostic performance of the oculoplastic specialists. The AUCs were determined to be 0.898 for the moderate-to-severe GO patients vs. mild GO patients, 0.975 for moderate-to-severe GO patients vs. normal controls, 0.781 for mild GO patients vs. normal controls, and 0.820 for moderate-to-severe GO patients vs. mild GO patients vs. normal controls. The experimental results indicate that the performance of the proposed NN is comparable to that of the oculoplastic specialists for the case of moderate-to-severe GO patients vs. normal controls.
In NN studies, it is important to review the learning curves of the models during training to diagnose learning issues such as overfitting. Figure 2 depicts the learning curves of the proposed NN at each epoch averaged over 30 experiments. To obtain reliable results, we used the repetitive holdout cross-validation with random split whenever conducting each experiment because the learning curves can vary according to test sets. The figure indicates stable learning that the learning curves increase monotonically and then reach the best performance at the 10th epoch without oscillation.

Discussion
In a previous study, it has been reported that NN can be applied to the classification of GO patients and prognosis prediction 16 . Our results demonstrated the possibility of applying machine learning techniques to orbital CT images to discriminate patients with GO from normal controls effectively. The diagnosis of GO is usually unambiguous in patients with a history of GD and typical clinical features. However, if there are atypical features, CT or magnetic resonance imaging (MRI) is required to rule out other important diagnose. MRI can accurately reflect changes in soft tissue, but can be more expensive. The criteria for diagnosing GO patients with CT alone has not yet been established. However, there are many reports that abnormal findings such as extraocular muscle hypertrophy have been reported on CT images of GO patients 7,17 . Therefore, if GO patients can be accurately diagnosed using CT scan, the risk of missing a diagnosis or unnecessary treatment can be avoided. We developed a novel convolutional NN that is specialized for GO diagnosis, and the AUC for diagnosing moderate-to-severe GO patients in comparison with normal controls was found to be the highest at 0.979, while it was 0.941 for discriminating patients with moderate-to-severe and mild GO. In addition, the proposed NN demonstrated much higher performance than conventional NNs such as GoogLeNet, ResNet-50, and VGG-16. The performance of the proposed NN was comparable to or even higher than that of oculoplastic specialists, introducing the possibility of clinical use in ophthalmic practice.
Our proposed NN contained three convolutional layers with three input channels followed by one fully connected layer and a final sigmoid classification layer for binary classification and softmax layer for multiclassification. Considering that more than 100 CT images with different planes must be analyzed for each patient, the configuration of multiple input channels ensures that all relevant features are available without loss 18 . The proposed NN has main differences compared with the conventional NNs. First, the proposed NN could improve www.nature.com/scientificreports/ the diagnostic performance by including information processing pipelines for CT images of each plane and combining them, a technique not implemented in other NNs. Second, the proposed NN involves a novel process of binocular comparison that may be beneficial for diagnosing other orbital diseases. Meanwhile, the features in the proposed NN are extracted with two-dimensional (2D) approaches rather than three-dimensional (3D) techniques. Since CT images are 3D, 2D techniques may lack 3D spatial and volume information significant for classification. Nevertheless, most convolutional NNs so far have been designed for natural 2D images 19 . A comparison of 3D and 2D convolutional NNs will be necessary in the future. The NNs were less effective in distinguishing patients with mild GO from normal controls than patients with moderate-to-severe GO form normal controls. This result may be due to the limitation of relying solely on orbital CT findings to diagnose mild GO patients. Although there have been no reports of the diagnostic accuracy of orbital CT alone for mild GO, this approach may not be accurate in diagnosing mild GO with eyelid retraction of less than 2 mm or exophthalmos less than 3 mm. As such patients are usually diagnosed based on clinical findings rather than orbital CT, NN using orbital CT images may be less effective. Nevertheless, the proposed NN showed higher diagnostic performance than oculoplastic specialists in discriminating patients with mild GO from normal controls. Substantial inconsistency in image assessment among observers has been reported in previous studies, and the classification of CT findings for orbital diseases may also be prone to high intra-and inter-observer fluctuations 20,21 . Therefore, it can be assumed that NNs are able to detect subtle and early changes that are difficult for humans to judge visually or consistently. Mild GO patients are often misdiagnosed as normal when their symptoms are not typical, and these patients can benefit most from NN in the diagnostic process. NNs can be helpful when there is a need to determine whether a patient has mild GO, such as when clinical features are inconclusive or when establishing a Graves' disease treatment policy.
Overfitting, in which a model is trained too well and represents a poor model for unseen data, is a common issue when training a model with a small dataset. Most medical data are limited in number due to prevalence and acquisition costs, and overfitting was also an issue in our study. To overcome this problem, we reduced the number of features by preprocessing and validated the learning procedure using cross-validation and a learning www.nature.com/scientificreports/ curve. When these methods were applied, the validation metric improved until a certain number of epochs was reached, then remained constant without decreasing, suggesting that no overfitting occurred. Although we developed an NN, it is difficult to elucidate how the NN classifies orbital CT images as normal or GO. We attempted to evaluate the logic structure through which the NN works by classifying the feature map patterns, but owing to the nature of NNs as black boxes, no further tracking was possible. Previous studies have reported that changes in the size of the extraocular muscle or lacrimal gland in orbital CT are important for predicting activity or severity in GO patients 17,22 . It is hard to know whether the NN classified orbital images according to changes in extraocular muscle or lacrimal gland in our study. It may be necessary to infer the diagnostic process of NN through segmentation of intra-orbital tissues and supervised training. Interpretability is important, and if it is impossible for a clinician to verify the logical mechanism or approach of NN, it is difficult to accept computer-aided diagnosis using NNs in clinical practice 19 . Therefore, further study is needed to visualize the convolutional layers and filters to form an idea of how machines classify images.
Our study has several limitations. First, there could have been selection bias due to the nature of the datasets, which were selected from a single tertiary hospital. As there is no standardized dataset of orbital CTs for GO, it was necessary to rely on the available hospital data. Further, the normal controls included patients who were evaluated for exophthalmos; therefore, the controls may not have accurately represented the normal population owing to the inclusion of patients with high myopia. However, there was no significant difference in the exophthalmos between mild GO patients and controls. Considering the fact that it is difficult to obtain an orbital CT scan of a normal subject without any ocular symptoms, the controls can be designated as a normal control group. Furthermore, this study was based on only one imaging modality. The current algorithm does not incorporate clinical information. The severity of GO is mainly determined by clinical findings, and there are no criteria for orbital CT diagnosis. Therefore, if other information such as age, sex, and clinical pictures are also used for judgment, further improvements in diagnostic performance can be expected.
In conclusion, we demonstrated the applicability of NNs to diagnosis of GO and its severity assessment using clinically routine orbital CTs. The proposed NN can reliably distinguish patients with moderate-to-severe GO from normal controls, although it is less effective in discriminating between patients with mild GO patients and normal controls. The performance of our technique is comparable to that of oculoplastic specialists. This research may contribute to NN-based interpretation of orbital CTs for diagnosing various orbital diseases. The code is publicly available at https:// github. com/ laymo nd1/ Graves-Orbit opathy-Diagn osis-using-Neural-Netwo www.nature.com/scientificreports/ rk. Further research is necessary to automate orbital CT analysis to develop a diagnosis system that is fully independent of human effort.

Materials and methods
The Institutional Review Board (IRB) of Chung-Ang University Hospital approved this study. This was a retrospective study, and the requirement for informed consent was waived by the IRB of Chung-Ang university Hospital (IRB No. 1912-004-358). This study was conducted in accordance with the ethical standards outlined in the Declaration of Helsinki.
Study participants. The orbital CTs used in this study were obtained from 200 patients with GO and 100 normal controls between December 2010 and December 2018. The GO patients were diagnosed based on Bartley's criteria 23 , and the severity of GO (mild or moderate-to-severe) was evaluated according to the severity scale of the European Group on Graves Orbitopathy (EUGOGO) consensus 24 . The patients with mild GO have one or more of the following: minor lid retraction (< 2 mm), mild soft-tissue involvement, exophthalmos < 3 mm above normal, no or intermittent diplopia and corneal exposure responsive to lubricants. The patients with moderateto-severe GO were defined as those with more severe clinical features without visual impairment than patients with mild GO. Normal controls were participants who visited the clinic for exophthalmos evaluation without any disease history and were confirmed to have no disease other than myopia. Exclusion criteria were an age of 18 years or less, previous orbital surgery, axial length greater than 27 mm in either eye, and other orbital pathology that may affect CT findings including myasthenia grave and progressive external ophthalmoplegia. The data of the patients including age, sex, MRD1, and exophthalmometry were recorded for analysis. CT images were obtained with one of the three following scanners: Brilliance CT 64 (Philips Medical Systems, Cleveland, OH, USA), Optima CT660 Freedom Edition (General Electric Medical Systems, Milwaukee, WI, USA), or IQon Elite Spectral CT (Philips Medical Systems, Cleveland, OH, USA). Continuous scanning with a slice thickness of 1.0 mm and slice increment of 1.0 mm was performed. All images were deidentified prior to transfer to the study investigators. The CT images were jointly evaluated by an ophthalmologist and a radiologist, and images that were incomplete or inconsistent with the clinical findings were excluded. In total, 288 CT image sets were obtained, including 99 cases of mild GO, 94 of moderate-to-severe GO, and 95 of normal controls. Data preparation. The obtained CT images in the axial, coronal, and sagittal planes were uploaded to a RadiAnt DICOM viewer (Medixant Co., Poznan, Poland). To overcome the variations caused by differences in CT equipment, spline interpolation was used to fix the number of images in each plane to 32. Then, we manually cropped the region of interest (ROI) and removed the remaining black margin. To meet the fixed-size input requirement for NNs, the CT images were then zoom-interpolated, enlarging the region of interest, to 128 × 128 for the axial (128, 128, 32) and sagittal (128, 128, 32) planes, and 64 × 128 for the coronal plane (64, 128, 32). The CT images were scaled to Hounsfield unit (HU) values, and fat and EOMs were selected in the ranges of −110 to −10, and 0-40 HU, respectively, to remove unnecessary pixels 17 . Finally, all images were normalized by scaling between 0 and 1. Figure 3 shows a schematic overview of our preprocessing steps.
All 288 cases were separated and combined into four experimental groups: (1) moderate-to-severe GO vs. normal controls, (2) mild GO vs. normal controls, (3) moderate-to-severe GO vs. mild GO, and (4) moderateto-severe GO vs. mild GO vs. normal controls. Next, each experimental group was represented as an isolated dataset. To mitigate the effects of selection bias due to sex and age, we used holdout cross-validation with random split regardless of the clinical or demographic characteristics of the participants; for each dataset, 80% were used as a training set to train the NNs and the remaining 20% were used as a test set to evaluate the trained NNs. The final performances of the proposed NN and existing NNs were measured by averaging the results of 30 repetitive experiments.

Convolutional NN.
In practice, it is preferable to consider CT images from axial, coronal, and sagittal planes because they deliver different information for diagnosing GO. However, conventional NN is designed to accept three-channel inputs such as RGB color images. Although a single image plane may be handled by increasing the number of input channels of existing NNs, the conventional NNs are unable to utilize these three image planes concurrently with their original input layer. To deal with this issue, we design a new NN that is able to accept three image planes concurrently. Figure 4 shows an overview of the proposed NN. Each cell describes the behavior of the operator and the shapes of input and output nodes. The proposed NN has three input layers consisting of 32-bit single-precision floating point elements that take axial (128 × 128 × 32), sagittal (128 × 128 × 32), and coronal images (64 × 128 × 32), which are handled independently before the final fully connected layer. In the proposed NN, first, features are extracted from convolutional and depth-wise convolutional layers based on the input CT images of the maximum three different planes. A max pooling layer follows each convolution operation. After the first step, the sizes of the axial and coronal images are reduced to 32 × 32 × 16 and 16 × 32 × 16, respectively. Since only one orbit is included in each sagittal image, the reduced size (32 × 32 × 32) was larger than those of the axial and coronal CT images. Next, feature maps are separately extracted from the left and right orbits of each image to compare the orbits to detect asymmetry or unilateral GO. For this purpose, we use depthwise convolution, where the filter size was set to half of the image size. Each convolutional filter extracts a real value from the separated left or right orbit image. Specifically, 16 convolutional filters are used for each orbit to produce the input values for the subsequent 16 × 2 nodes. Third, each group of 16 × 2 nodes is flattened to 32 × 1 nodes, which are fully connected to the subsequent 4 × 1 nodes. Finally, the output values of 4 × 1 nodes are transferred to the output layer. The output layer includes one sigmoid node to  Figure 5 shows a schematic block diagram of the proposed NN. As shown in Fig. 5, the proposed NN is divided into three parts of convolution layers, the fully connected layer, and the classifier. The parts up to the fully connected layer are calculated independently for each plane, and after concatenation, the extracted features are combined and used for prediction. In addition, Supplementary Fig. S1 visualizes the learned convolutional   [25][26][27] . The NNs were implemented using the Tensorflow (2.1.0) and Keras (2.3.1) APIs, and overall experiments were executed on a GTX 1080Ti 11 GB GPU. For a fair comparison, we used the well-known Xavier initialization method supported as default in the APIs for all networks 28 . Specifically, because conventional NNs can train only one of image planes, we reported the performance values for the conventional NNs when CT images in a single plane were used as the input data. Each sample in the ImageNet dataset has 3 channels, which is RGB color scale, on the other hand, each sample of the medical data used in the experiment has 32 input channels. Thus, these convolutional networks were trained from scratch. For comparison, three oculoplastic specialists were asked to perform four independent experiments in which they compared three experimental groups with full CT sets without any clinical information, as was done with the proposed and conventional NNs. The final grading was decided by the majority, and the diagnostic performance in terms of the area under the receiver operating characteristic (ROC) curve was compared with that of the proposed NN.
Statistical analysis. All statistical analyses were performed using the open-source software R 3.4.0 (R Foundation for Statistical Computing, Vienna, Austria). The data were expressed as averages with standard deviations for the continuous variables and as sample numbers for the categorical variables. The differences in age, sex, and clinical features between the clinical groups were analyzed by one-way analysis of variance with Games-Howell post-hoc analyses and Pearson's chi-square test, respectively. A value of p < 0.05 was considered statistically significant. Receiver operating characteristic (ROC) curve analysis was performed to evaluate the diagnostic performances of the various NNs and those of the oculoplastic specialists in terms of area under the ROC curve (AUC).

Data availability
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.